A significant number of hotel bookings are called off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
# importing necessary libraries
%load_ext nb_black
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
make_scorer,
)
# import dataset
hotel = pd.read_csv("C:/Users/USER/Downloads/INNHotelsGroup.csv")
data = (
hotel.copy()
) # copy data into new variable to avoid any changes to the original copy
data.head() # view first five rows of the dataset
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00000 | 0 | Not_Canceled |
| 1 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68000 | 1 | Not_Canceled |
| 2 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00000 | 0 | Canceled |
| 3 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00000 | 0 | Canceled |
| 4 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50000 | 0 | Canceled |
data.tail() # view last five rows of the dataset
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36270 | INN36271 | 3 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 4 | 85 | 2018 | 8 | 3 | Online | 0 | 0 | 0 | 167.80000 | 1 | Not_Canceled |
| 36271 | INN36272 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 228 | 2018 | 10 | 17 | Online | 0 | 0 | 0 | 90.95000 | 2 | Canceled |
| 36272 | INN36273 | 2 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 1 | 148 | 2018 | 7 | 1 | Online | 0 | 0 | 0 | 98.39000 | 2 | Not_Canceled |
| 36273 | INN36274 | 2 | 0 | 0 | 3 | Not Selected | 0 | Room_Type 1 | 63 | 2018 | 4 | 21 | Online | 0 | 0 | 0 | 94.50000 | 0 | Canceled |
| 36274 | INN36275 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 207 | 2018 | 12 | 30 | Offline | 0 | 0 | 0 | 161.67000 | 0 | Not_Canceled |
data.shape # the number of rows and columns in the dataset
(36275, 19)
data.info() # concise summary of the columns of the dataset
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.3+ MB
data.duplicated().sum() # Check for duplicate values
0
data.isnull().sum() # check for missing values
Booking_ID 0 no_of_adults 0 no_of_children 0 no_of_weekend_nights 0 no_of_week_nights 0 type_of_meal_plan 0 required_car_parking_space 0 room_type_reserved 0 lead_time 0 arrival_year 0 arrival_month 0 arrival_date 0 market_segment_type 0 repeated_guest 0 no_of_previous_cancellations 0 no_of_previous_bookings_not_canceled 0 avg_price_per_room 0 no_of_special_requests 0 booking_status 0 dtype: int64
data.nunique() # number of unique values in the columns of the dataset.
Booking_ID 36275 no_of_adults 5 no_of_children 6 no_of_weekend_nights 8 no_of_week_nights 18 type_of_meal_plan 4 required_car_parking_space 2 room_type_reserved 7 lead_time 352 arrival_year 2 arrival_month 12 arrival_date 31 market_segment_type 5 repeated_guest 2 no_of_previous_cancellations 9 no_of_previous_bookings_not_canceled 59 avg_price_per_room 3930 no_of_special_requests 6 booking_status 2 dtype: int64
data = data.drop(["Booking_ID"], axis=1)
data.describe().T # statistical summary of the dataset
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36275.00000 | 1.84496 | 0.51871 | 0.00000 | 2.00000 | 2.00000 | 2.00000 | 4.00000 |
| no_of_children | 36275.00000 | 0.10528 | 0.40265 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 10.00000 |
| no_of_weekend_nights | 36275.00000 | 0.81072 | 0.87064 | 0.00000 | 0.00000 | 1.00000 | 2.00000 | 7.00000 |
| no_of_week_nights | 36275.00000 | 2.20430 | 1.41090 | 0.00000 | 1.00000 | 2.00000 | 3.00000 | 17.00000 |
| required_car_parking_space | 36275.00000 | 0.03099 | 0.17328 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| lead_time | 36275.00000 | 85.23256 | 85.93082 | 0.00000 | 17.00000 | 57.00000 | 126.00000 | 443.00000 |
| arrival_year | 36275.00000 | 2017.82043 | 0.38384 | 2017.00000 | 2018.00000 | 2018.00000 | 2018.00000 | 2018.00000 |
| arrival_month | 36275.00000 | 7.42365 | 3.06989 | 1.00000 | 5.00000 | 8.00000 | 10.00000 | 12.00000 |
| arrival_date | 36275.00000 | 15.59700 | 8.74045 | 1.00000 | 8.00000 | 16.00000 | 23.00000 | 31.00000 |
| repeated_guest | 36275.00000 | 0.02564 | 0.15805 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| no_of_previous_cancellations | 36275.00000 | 0.02335 | 0.36833 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 13.00000 |
| no_of_previous_bookings_not_canceled | 36275.00000 | 0.15341 | 1.75417 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 58.00000 |
| avg_price_per_room | 36275.00000 | 103.42354 | 35.08942 | 0.00000 | 80.30000 | 99.45000 | 120.00000 | 540.00000 |
| no_of_special_requests | 36275.00000 | 0.61966 | 0.78624 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 5.00000 |
# Function needed to perform exploratory data analysis using histogram and boxplot
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Function needed to perform exploratory data analysis using histogram and boxplot
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# Observation on number of adults
labeled_barplot(data, "no_of_adults", perc=True)
# Observation on number of children
labeled_barplot(data, "no_of_children", perc=True)
# replacing 9, and 10 children with 3
data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)
# Observation on number of week nights
labeled_barplot(data, "no_of_week_nights", perc=True)
# Observation on number of weekend nights
labeled_barplot(data, "no_of_weekend_nights", perc=True)
# Observation on car parking space required
labeled_barplot(data, "required_car_parking_space", perc=True)
# Observation on type of meal plan
labeled_barplot(data, "type_of_meal_plan", perc=True)
# Observation on room type reserved
labeled_barplot(data, "room_type_reserved", perc=True)
# Observation on arrival month
labeled_barplot(data, "arrival_month", perc=True)
# Observation on market segment type
labeled_barplot(data, "market_segment_type", perc=True)
# Observation on number of special requests
labeled_barplot(data, "no_of_special_requests", perc=True)
# Observation on booking status
labeled_barplot(data, "booking_status", perc=True)
# encode canceled bookings to 1 and not canceled to 0
data["booking_status"] = data["booking_status"].apply(
lambda x: 1 if x == "Canceled" else 0
)
histogram_boxplot(data, "lead_time") # observation on lead time
histogram_boxplot(data, "avg_price_per_room") # observation on average price per room
data[
data["avg_price_per_room"] == 0
] # filter rows with average price per room equal to zero
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 63 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 2 | 2017 | 9 | 10 | Complementary | 0 | 0 | 0 | 0.00000 | 1 | 0 |
| 145 | 1 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 13 | 2018 | 6 | 1 | Complementary | 1 | 3 | 5 | 0.00000 | 1 | 0 |
| 209 | 1 | 0 | 0 | 0 | Meal Plan 1 | 0 | Room_Type 1 | 4 | 2018 | 2 | 27 | Complementary | 0 | 0 | 0 | 0.00000 | 1 | 0 |
| 266 | 1 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2017 | 8 | 12 | Complementary | 1 | 0 | 1 | 0.00000 | 1 | 0 |
| 267 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 4 | 2017 | 8 | 23 | Complementary | 0 | 0 | 0 | 0.00000 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 35983 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 7 | 0 | 2018 | 6 | 7 | Complementary | 1 | 4 | 17 | 0.00000 | 1 | 0 |
| 36080 | 1 | 0 | 1 | 1 | Meal Plan 1 | 0 | Room_Type 7 | 0 | 2018 | 3 | 21 | Complementary | 1 | 3 | 15 | 0.00000 | 1 | 0 |
| 36114 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 3 | 2 | Online | 0 | 0 | 0 | 0.00000 | 0 | 0 |
| 36217 | 2 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 2 | 3 | 2017 | 8 | 9 | Online | 0 | 0 | 0 | 0.00000 | 2 | 0 |
| 36250 | 1 | 0 | 0 | 2 | Meal Plan 2 | 0 | Room_Type 1 | 6 | 2017 | 12 | 10 | Online | 0 | 0 | 0 | 0.00000 | 0 | 0 |
545 rows × 18 columns
data.loc[
data["avg_price_per_room"] == 0, "market_segment_type"
].value_counts() # show the count of rows with average price per room equal to zero and market segment type
Complementary 354 Online 191 Name: market_segment_type, dtype: int64
# Calculating the 25th quantile
Q1 = data["avg_price_per_room"].quantile(
0.25
) # calculate 25th quantile for average price per room
# Calculating the 75th quantile
Q3 = data["avg_price_per_room"].quantile(
0.75
) # calculate 75th quantile for average price per room
# Calculating IQR
IQR = Q3 - Q1
# Calculating value of upper whisker
Upper_Whisker = Q3 + 1.5 * IQR
print("The value of the upper whisker is", Upper_Whisker)
The value of the upper whisker is 179.55
# assigning the outliers the value of upper whisker
data.loc[data["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker
Hotel rates are dynamic and change according to demand and customer demographics. Let's see how prices vary across different market segments
sns.boxplot(data=data, y="avg_price_per_room", x="market_segment_type")
# boxplot on average price and market segment
<AxesSubplot:xlabel='market_segment_type', ylabel='avg_price_per_room'>
histogram_boxplot(
data, "no_of_previous_cancellations"
) # observation on number of previous booking cancellations
histogram_boxplot(
data, "no_of_previous_bookings_not_canceled"
) # observation on number of previous bookings not canceled
cols_list = data.select_dtypes(
include=np.number
).columns.tolist() # create array of numerical type columns
plt.figure(figsize=(12, 7)) # size configurations of the plot
# Create heatmap of numerical columns
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
# show plot
Let's see how booking status varies across different market segments. Also, how average price per room impacts booking status
stacked_barplot(
data, "market_segment_type", "booking_status"
) # bar graph of market segment type and booking status
booking_status 0 1 All market_segment_type All 24390 11885 36275 Online 14739 8475 23214 Offline 7375 3153 10528 Corporate 1797 220 2017 Aviation 88 37 125 Complementary 391 0 391 ------------------------------------------------------------------------------------------------------------------------
plt.figure(figsize=(10, 6)) # size configurations
sns.boxplot(
data=data, x="booking_status", y="avg_price_per_room"
) # create a boxplot of the two variables
<AxesSubplot:xlabel='booking_status', ylabel='avg_price_per_room'>
Many guests have special requirements when booking a hotel room. Let's see how it impacts cancellations
plt.figure(figsize=(10, 6)) # size configuration
sns.barplot(
data=data, x="booking_status", y="no_of_special_requests"
) # create a bar graph of the given variables
<AxesSubplot:xlabel='booking_status', ylabel='no_of_special_requests'>
plt.figure(figsize=(10, 6)) # size configuration
sns.barplot(
data=data, x="no_of_special_requests", y="no_of_previous_cancellations",
) # create a bar graph of given variables
<AxesSubplot:xlabel='no_of_special_requests', ylabel='no_of_previous_cancellations'>
plt.figure(figsize=(10, 6)) # size configurations
sns.barplot(
data=data, x="no_of_special_requests", y="no_of_previous_bookings_not_canceled"
) # create a bar graph of the given variables
<AxesSubplot:xlabel='no_of_special_requests', ylabel='no_of_previous_bookings_not_canceled'>
Let's see if the special requests made by the customers impacts the prices of a room
plt.figure(figsize=(10, 6)) # size configurations
sns.boxplot(
data=data, x="no_of_special_requests", y="avg_price_per_room"
) # create boxplot for no of special requests and average price per room (excluding the outliers)
plt.show() # plot graph
We saw earlier that there is a positive correlation between booking status and average price per room. Let's analyze it
distribution_plot_wrt_target(
data, "avg_price_per_room", "booking_status"
) # histogram and boxplot of average price per room and booking status
There is a positive correlation between booking status and lead time also. Let's analyze it further
distribution_plot_wrt_target(
data, "lead_time", "booking_status"
) # histogram and boxplot of lead time and booking status
Generally people travel with their spouse and children for vacations or other activities. Let's create a new dataframe of the customers who traveled with their families and analyze the impact on booking status.
family_data = data[
(data["no_of_children"] >= 0) & (data["no_of_adults"] > 1)
] # create dataframe with number of children equal to 1 and number of adults greater than 1
family_data.shape # show number of rows and columns of the filtered dataset
(28441, 18)
family_data["no_of_family_members"] = (
family_data["no_of_adults"] + family_data["no_of_children"]
) # create column with total number of adults and children
plt.figure(figsize=(10, 6)) # size configurations
sns.barplot(data=data, x="booking_status", y=family_data["no_of_family_members"])
# barplot of booking status and total family size
plt.figure(figsize=(10, 6)) # size configurations
sns.boxplot(data=data, x="booking_status", y=family_data["no_of_family_members"])
# boxplot of booking status and total family size
Similar analysis for the customer who stay for at least a day at the hotel.
stay_data = data[
(data["no_of_week_nights"] > 0) & (data["no_of_weekend_nights"] > 0)
] # create dataframe with number of week nights greater than zero and number of weekend nights greater than 0.
stay_data.shape # number of rows and columns of dataset
(17094, 18)
stay_data["total_days"] = (
stay_data["no_of_week_nights"] + stay_data["no_of_weekend_nights"]
) # create dataset with total number of days spent
plt.figure(figsize=(10, 6)) # size configurations
sns.barplot(data=data, x="booking_status", y=stay_data["total_days"])
# bar graph of booking status and total days spent
stay_data["no_of_week_nights"].sum() # total number of bookings for weekdays
43443
stay_data["no_of_weekend_nights"].sum() # total number of bookings for weekends
26313
Repeating guests are the guests who stay in the hotel often and are important to brand equity. Let's see what percentage of repeating guests cancel?
labeled_barplot(
data, "repeated_guest", perc=True
) # bar graph of repeated guest feature
Let's find out what are the busiest months in the hotel.
# grouping the data on arrival months and extracting the count of bookings
monthly_data = data.groupby(["arrival_month"])["booking_status"].count()
# creating a dataframe with months and count of customers in each month
monthly_data = pd.DataFrame(
{"Month": list(monthly_data.index), "Guests": list(monthly_data.values)}
)
# plotting the trend over different months
plt.figure(figsize=(10, 5)) # size parameters
sns.lineplot(data=monthly_data, x="Month", y="Guests")
plt.show() # show graph
Let's check the percentage of bookings canceled in each month.
canceled_bookings_per_month = data[
(data["booking_status"] == 1)
] # filter canceled bookings per month
canceled_bookings_per_month["arrival_month"].value_counts(
normalize=True
) # show relative count of results
10 0.15818 9 0.12941 8 0.12520 7 0.11056 6 0.10862 4 0.08372 5 0.07976 11 0.07362 3 0.05890 2 0.03618 12 0.03382 1 0.00202 Name: arrival_month, dtype: float64
labeled_barplot(
canceled_bookings_per_month, "arrival_month", perc=True
) # bar graph of arrival month
As hotel room prices are dynamic, Let's see how the prices vary across different months
plt.figure(figsize=(12, 8)) # size parameters
sns.lineplot(
data=data, x="arrival_month", y="avg_price_per_room"
) # create lineplot between average price per room and arrival month
plt.show() # show graph
plt.figure(figsize=(11, 8)) # size configurations
sns.boxplot(data=data, x="arrival_month", y="avg_price_per_room")
# boxplot of arrival month and average price per room
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
# dropping booking_status
numeric_columns.remove("booking_status")
plt.figure(figsize=(15, 12)) # size parameters
# generate outlier graphs for the various features
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
X = data.drop(["booking_status"], axis=1) # drop dependent variable
Y = data["booking_status"] # assign dependent variable to variable Y
# adding constant
X = sm.add_constant(X) # add constant to X
X = pd.get_dummies(X, drop_first=True) # create dummies for X
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
) # split the data into train test in the ratio 70:30 with random_state = 1
X.head() # first five rows of the dataset
| const | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Meal Plan 3 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | market_segment_type_Online | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.00000 | 2 | 0 | 1 | 2 | 0 | 224 | 2017 | 10 | 2 | 0 | 0 | 0 | 65.00000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 1.00000 | 2 | 0 | 2 | 3 | 0 | 5 | 2018 | 11 | 6 | 0 | 0 | 0 | 106.68000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 1.00000 | 1 | 0 | 2 | 1 | 0 | 1 | 2018 | 2 | 28 | 0 | 0 | 0 | 60.00000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 1.00000 | 2 | 0 | 0 | 2 | 0 | 211 | 2018 | 5 | 20 | 0 | 0 | 0 | 100.00000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 1.00000 | 2 | 0 | 1 | 1 | 0 | 48 | 2018 | 4 | 11 | 0 | 0 | 0 | 94.50000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
# print statements of the number of rows and columns in train and test data set and their relative percentages.
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (25392, 28) Shape of test set : (10883, 28) Percentage of classes in training set: 0 0.67064 1 0.32936 Name: booking_status, dtype: float64 Percentage of classes in test set: 0 0.67638 1 0.32362 Name: booking_status, dtype: float64
X_train.info() # concise summary of dataframe
<class 'pandas.core.frame.DataFrame'> Int64Index: 25392 entries, 13662 to 33003 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 const 25392 non-null float64 1 no_of_adults 25392 non-null int64 2 no_of_children 25392 non-null int64 3 no_of_weekend_nights 25392 non-null int64 4 no_of_week_nights 25392 non-null int64 5 required_car_parking_space 25392 non-null int64 6 lead_time 25392 non-null int64 7 arrival_year 25392 non-null int64 8 arrival_month 25392 non-null int64 9 arrival_date 25392 non-null int64 10 repeated_guest 25392 non-null int64 11 no_of_previous_cancellations 25392 non-null int64 12 no_of_previous_bookings_not_canceled 25392 non-null int64 13 avg_price_per_room 25392 non-null float64 14 no_of_special_requests 25392 non-null int64 15 type_of_meal_plan_Meal Plan 2 25392 non-null uint8 16 type_of_meal_plan_Meal Plan 3 25392 non-null uint8 17 type_of_meal_plan_Not Selected 25392 non-null uint8 18 room_type_reserved_Room_Type 2 25392 non-null uint8 19 room_type_reserved_Room_Type 3 25392 non-null uint8 20 room_type_reserved_Room_Type 4 25392 non-null uint8 21 room_type_reserved_Room_Type 5 25392 non-null uint8 22 room_type_reserved_Room_Type 6 25392 non-null uint8 23 room_type_reserved_Room_Type 7 25392 non-null uint8 24 market_segment_type_Complementary 25392 non-null uint8 25 market_segment_type_Corporate 25392 non-null uint8 26 market_segment_type_Offline 25392 non-null uint8 27 market_segment_type_Online 25392 non-null uint8 dtypes: float64(2), int64(13), uint8(13) memory usage: 3.4 MB
# fitting logistic regression model
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit() # fit logistic regression
print(lg.summary()) # print summary of the model
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.425090
Iterations: 35
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25364
Method: MLE Df Model: 27
Date: Fri, 27 Jan 2023 Pseudo R-squ.: 0.3292
Time: 15:39:33 Log-Likelihood: -10794.
converged: False LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -922.8266 120.832 -7.637 0.000 -1159.653 -686.000
no_of_adults 0.1137 0.038 3.019 0.003 0.040 0.188
no_of_children 0.1580 0.062 2.544 0.011 0.036 0.280
no_of_weekend_nights 0.1067 0.020 5.395 0.000 0.068 0.145
no_of_week_nights 0.0397 0.012 3.235 0.001 0.016 0.064
required_car_parking_space -1.5943 0.138 -11.565 0.000 -1.865 -1.324
lead_time 0.0157 0.000 58.863 0.000 0.015 0.016
arrival_year 0.4561 0.060 7.617 0.000 0.339 0.573
arrival_month -0.0417 0.006 -6.441 0.000 -0.054 -0.029
arrival_date 0.0005 0.002 0.259 0.796 -0.003 0.004
repeated_guest -2.3472 0.617 -3.806 0.000 -3.556 -1.139
no_of_previous_cancellations 0.2664 0.086 3.108 0.002 0.098 0.434
no_of_previous_bookings_not_canceled -0.1727 0.153 -1.131 0.258 -0.472 0.127
avg_price_per_room 0.0188 0.001 25.396 0.000 0.017 0.020
no_of_special_requests -1.4689 0.030 -48.782 0.000 -1.528 -1.410
type_of_meal_plan_Meal Plan 2 0.1756 0.067 2.636 0.008 0.045 0.306
type_of_meal_plan_Meal Plan 3 17.3584 3987.836 0.004 0.997 -7798.656 7833.373
type_of_meal_plan_Not Selected 0.2784 0.053 5.247 0.000 0.174 0.382
room_type_reserved_Room_Type 2 -0.3605 0.131 -2.748 0.006 -0.618 -0.103
room_type_reserved_Room_Type 3 -0.0012 1.310 -0.001 0.999 -2.568 2.566
room_type_reserved_Room_Type 4 -0.2823 0.053 -5.304 0.000 -0.387 -0.178
room_type_reserved_Room_Type 5 -0.7189 0.209 -3.438 0.001 -1.129 -0.309
room_type_reserved_Room_Type 6 -0.9501 0.151 -6.274 0.000 -1.247 -0.653
room_type_reserved_Room_Type 7 -1.4003 0.294 -4.770 0.000 -1.976 -0.825
market_segment_type_Complementary -40.5975 5.65e+05 -7.19e-05 1.000 -1.11e+06 1.11e+06
market_segment_type_Corporate -1.1924 0.266 -4.483 0.000 -1.714 -0.671
market_segment_type_Offline -2.1946 0.255 -8.621 0.000 -2.694 -1.696
market_segment_type_Online -0.3995 0.251 -1.590 0.112 -0.892 0.093
========================================================================================================
print("Training performance:") # print this statement
model_performance_classification_statsmodels(
lg, X_train, y_train
) # show performance of model on training data set
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80600 | 0.63410 | 0.73971 | 0.68285 |
print("Training performance:") # print statement
model_performance_classification_statsmodels(
lg, X_test, y_test
) # show performance of model on test data set
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80493 | 0.63260 | 0.72882 | 0.67731 |
The model is performing well on the testing set.
# create boxplot and histogram for numeric columns
for col in [
"no_of_adults",
"no_of_children",
"no_of_weekend_nights",
"no_of_week_nights",
"lead_time",
"no_of_previous_cancellations",
"no_of_previous_bookings_not_canceled",
"avg_price_per_room",
"no_of_special_requests",
"booking_status",
]:
histogram_boxplot(data, col) # create histogram and boxplot
for col in [
"required_car_parking_space",
"arrival_year",
"arrival_month",
"arrival_year",
"repeated_guest",
"type_of_meal_plan",
"room_type_reserved",
"market_segment_type",
]:
labeled_barplot(data, col, perc=True) # create bar graph for the above features
plt.figure(figsize=(12, 7)) # size parameters
sns.pairplot(data, hue="booking_status")
# pair plot for all variables in the dataset
<seaborn.axisgrid.PairGrid at 0x28654f6b9a0>
<Figure size 864x504 with 0 Axes>
# function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(X_train) # display results of function
| feature | VIF | |
|---|---|---|
| 0 | const | 39497686.20788 |
| 1 | no_of_adults | 1.35113 |
| 2 | no_of_children | 2.09358 |
| 3 | no_of_weekend_nights | 1.06948 |
| 4 | no_of_week_nights | 1.09571 |
| 5 | required_car_parking_space | 1.03997 |
| 6 | lead_time | 1.39517 |
| 7 | arrival_year | 1.43190 |
| 8 | arrival_month | 1.27633 |
| 9 | arrival_date | 1.00679 |
| 10 | repeated_guest | 1.78358 |
| 11 | no_of_previous_cancellations | 1.39569 |
| 12 | no_of_previous_bookings_not_canceled | 1.65200 |
| 13 | avg_price_per_room | 2.06860 |
| 14 | no_of_special_requests | 1.24798 |
| 15 | type_of_meal_plan_Meal Plan 2 | 1.27328 |
| 16 | type_of_meal_plan_Meal Plan 3 | 1.02526 |
| 17 | type_of_meal_plan_Not Selected | 1.27306 |
| 18 | room_type_reserved_Room_Type 2 | 1.10595 |
| 19 | room_type_reserved_Room_Type 3 | 1.00330 |
| 20 | room_type_reserved_Room_Type 4 | 1.36361 |
| 21 | room_type_reserved_Room_Type 5 | 1.02800 |
| 22 | room_type_reserved_Room_Type 6 | 2.05614 |
| 23 | room_type_reserved_Room_Type 7 | 1.11816 |
| 24 | market_segment_type_Complementary | 4.50276 |
| 25 | market_segment_type_Corporate | 16.92829 |
| 26 | market_segment_type_Offline | 64.11564 |
| 27 | market_segment_type_Online | 71.18026 |
# initial list of columns
cols = X_train.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = X_train[cols]
# fitting the model
model = sm.Logit(y_train, x_train_aux).fit(disp=False)
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features) # print selected features to be used
['const', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'repeated_guest', 'no_of_previous_cancellations', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Corporate', 'market_segment_type_Offline']
X_train1 = X_train[selected_features] # selected features on train data set
X_test1 = X_test[selected_features] # selected features on test data set
logit1 = sm.Logit(
y_train, X_train1.astype(float)
) # train logistic regression on X_train1 and y_train
lg1 = logit1.fit(disp=False) # fit logistic regression
print(lg1.summary()) # print summary of the model
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25370
Method: MLE Df Model: 21
Date: Fri, 27 Jan 2023 Pseudo R-squ.: 0.3282
Time: 15:45:21 Log-Likelihood: -10810.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const -915.6391 120.471 -7.600 0.000 -1151.758 -679.520
no_of_adults 0.1088 0.037 2.914 0.004 0.036 0.182
no_of_children 0.1531 0.062 2.470 0.014 0.032 0.275
no_of_weekend_nights 0.1086 0.020 5.498 0.000 0.070 0.147
no_of_week_nights 0.0417 0.012 3.399 0.001 0.018 0.066
required_car_parking_space -1.5947 0.138 -11.564 0.000 -1.865 -1.324
lead_time 0.0157 0.000 59.213 0.000 0.015 0.016
arrival_year 0.4523 0.060 7.576 0.000 0.335 0.569
arrival_month -0.0425 0.006 -6.591 0.000 -0.055 -0.030
repeated_guest -2.7367 0.557 -4.916 0.000 -3.828 -1.646
no_of_previous_cancellations 0.2288 0.077 2.983 0.003 0.078 0.379
avg_price_per_room 0.0192 0.001 26.336 0.000 0.018 0.021
no_of_special_requests -1.4698 0.030 -48.884 0.000 -1.529 -1.411
type_of_meal_plan_Meal Plan 2 0.1642 0.067 2.469 0.014 0.034 0.295
type_of_meal_plan_Not Selected 0.2860 0.053 5.406 0.000 0.182 0.390
room_type_reserved_Room_Type 2 -0.3552 0.131 -2.709 0.007 -0.612 -0.098
room_type_reserved_Room_Type 4 -0.2828 0.053 -5.330 0.000 -0.387 -0.179
room_type_reserved_Room_Type 5 -0.7364 0.208 -3.535 0.000 -1.145 -0.328
room_type_reserved_Room_Type 6 -0.9682 0.151 -6.403 0.000 -1.265 -0.672
room_type_reserved_Room_Type 7 -1.4343 0.293 -4.892 0.000 -2.009 -0.860
market_segment_type_Corporate -0.7913 0.103 -7.692 0.000 -0.993 -0.590
market_segment_type_Offline -1.7854 0.052 -34.363 0.000 -1.887 -1.684
==================================================================================================
print("Training performance:") # print statement
model_performance_classification_statsmodels(
lg1, X_train1, y_train
) # check performance on X_train1 and y_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80545 | 0.63267 | 0.73907 | 0.68174 |
# converting coefficients to odds
odds = np.exp(lg1.params)
# finding the percentage change
perc_change_odds = (np.exp(lg1.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train1.columns)
| Odds | Change_odd% | |
|---|---|---|
| const | 0.00000 | -100.00000 |
| no_of_adults | 1.11491 | 11.49096 |
| no_of_children | 1.16546 | 16.54593 |
| no_of_weekend_nights | 1.11470 | 11.46966 |
| no_of_week_nights | 1.04258 | 4.25841 |
| required_car_parking_space | 0.20296 | -79.70395 |
| lead_time | 1.01583 | 1.58331 |
| arrival_year | 1.57195 | 57.19508 |
| arrival_month | 0.95839 | -4.16120 |
| repeated_guest | 0.06478 | -93.52180 |
| no_of_previous_cancellations | 1.25712 | 25.71181 |
| avg_price_per_room | 1.01937 | 1.93684 |
| no_of_special_requests | 0.22996 | -77.00374 |
| type_of_meal_plan_Meal Plan 2 | 1.17846 | 17.84641 |
| type_of_meal_plan_Not Selected | 1.33109 | 33.10947 |
| room_type_reserved_Room_Type 2 | 0.70104 | -29.89588 |
| room_type_reserved_Room_Type 4 | 0.75364 | -24.63551 |
| room_type_reserved_Room_Type 5 | 0.47885 | -52.11548 |
| room_type_reserved_Room_Type 6 | 0.37977 | -62.02290 |
| room_type_reserved_Room_Type 7 | 0.23827 | -76.17294 |
| market_segment_type_Corporate | 0.45326 | -54.67373 |
| market_segment_type_Offline | 0.16773 | -83.22724 |
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_train1, y_train)
print("Training performance:")
log_reg_model_train_perf = model_performance_classification_statsmodels(
lg1, X_train1, y_train
) # performance on X_train1 and y_train
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80545 | 0.63267 | 0.73907 | 0.68174 |
# parameters to plot roc-auc graph
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(X_train1))
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc) # show optimal threshold value
0.3700522558707844
# creating confusion matrix
confusion_matrix_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
) # confusion matrix for X_train1 and y_train with optimal_threshold_auc_roc as threshold
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc # performance on train data set
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.79265 | 0.73622 | 0.66808 | 0.70049 |
y_scores = lg1.predict(X_train1)
prec, rec, tre = precision_recall_curve(
y_train, y_scores,
) # axes for precision-recall curve
# parameters to draw curve
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.42
# creating confusion matrix
confusion_matrix_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_curve,
) # confusion matrix for X_train1 and y_train with optimal_threshold_curve as threshold
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve # performance on train data set
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80132 | 0.69939 | 0.69797 | 0.69868 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.38 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.38 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.80545 | 0.79265 | 0.80132 |
| Recall | 0.63267 | 0.73622 | 0.69939 |
| Precision | 0.73907 | 0.66808 | 0.69797 |
| F1 | 0.68174 | 0.70049 | 0.69868 |
# creating confusion matrix
confusion_matrix_statsmodels(
lg1, X_test1, y_test
) # Complete the code to create confusion matrix for X_test1 and y_test
log_reg_model_test_perf = model_performance_classification_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc
) # check performance on X_test1 and y_test
print("Test performance:") # print statement
log_reg_model_test_perf # performance on test data
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.79555 | 0.73964 | 0.66573 | 0.70074 |
# parameters to draw curve
logit_roc_auc_train = roc_auc_score(y_test, lg1.predict(X_test1))
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict(X_test1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# creating confusion matrix
confusion_matrix_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc
) # create confusion matrix for X_test1 and y_test using optimal_threshold_auc_roc as threshold
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc # performance on test data
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.79555 | 0.73964 | 0.66573 | 0.70074 |
Using model with threshold=0.42
# creating confusion matrix
confusion_matrix_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_curve
) # create confusion matrix for X_test1 and y_test using optimal_threshold_curve as threshold
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve # performance on test data
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80345 | 0.70358 | 0.69353 | 0.69852 |
Model Performance Summary
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression-default Threshold | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.80545 | 0.79265 | 0.80132 |
| Recall | 0.63267 | 0.73622 | 0.69939 |
| Precision | 0.73907 | 0.66808 | 0.69797 |
| F1 | 0.68174 | 0.70049 | 0.69868 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
| Logistic Regression-default Threshold | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.79555 | 0.79555 | 0.80345 |
| Recall | 0.73964 | 0.73964 | 0.70358 |
| Precision | 0.66573 | 0.66573 | 0.69353 |
| F1 | 0.70074 | 0.70074 | 0.69852 |
X = data.drop(["booking_status"], axis=1) # drop dependent variable from dataset
Y = data["booking_status"] # assign variable Y to dependent variable
X = pd.get_dummies(X, drop_first=True) # create dummies for X
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
) # split the data into train test in the ratio 70:30 with random_state = 1
print(
"Shape of Training set : ", X_train.shape
) # number of rows and columns of training set
print("Shape of test set : ", X_test.shape) # number of rows and columns of testing set
print("Percentage of classes in training set:") # print statement
print(y_train.value_counts(normalize=True)) # ratio of both classes in training set
print("Percentage of classes in test set:") # print statement
print(y_test.value_counts(normalize=True)) # ratio of both classes in test set
Shape of Training set : (25392, 27) Shape of test set : (10883, 27) Percentage of classes in training set: 0 0.67064 1 0.32936 Name: booking_status, dtype: float64 Percentage of classes in test set: 0 0.67638 1 0.32362 Name: booking_status, dtype: float64
X_train.head() # first five rows of training data set
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Meal Plan 3 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | market_segment_type_Online | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13662 | 1 | 0 | 0 | 1 | 0 | 163 | 2018 | 10 | 15 | 0 | 0 | 0 | 115.00000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 26641 | 2 | 0 | 0 | 3 | 0 | 113 | 2018 | 3 | 31 | 0 | 0 | 0 | 78.15000 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 17835 | 2 | 0 | 2 | 3 | 0 | 359 | 2018 | 10 | 14 | 0 | 0 | 0 | 78.00000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 21485 | 2 | 0 | 0 | 3 | 0 | 136 | 2018 | 6 | 29 | 0 | 0 | 0 | 85.50000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 5670 | 2 | 0 | 1 | 2 | 0 | 21 | 2018 | 8 | 15 | 0 | 0 | 0 | 151.00000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
model = DecisionTreeClassifier(random_state=1) # create decision tree
model.fit(X_train, y_train) ## fit decision tree on train data
DecisionTreeClassifier(random_state=1)
confusion_matrix_sklearn(model, X_train, y_train) # create confusion matrix
decision_tree_perf_train_without = model_performance_classification_sklearn(
model, X_train, y_train
) # performance of decison tree on training data
decision_tree_perf_train_without
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99421 | 0.98661 | 0.99578 | 0.99117 |
confusion_matrix_sklearn(model, X_test, y_test) # create confusion matrix
decision_tree_perf_test_without = model_performance_classification_sklearn(
model, X_test, y_test
) # performance of decison tree on test data
decision_tree_perf_test_without
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.87118 | 0.81175 | 0.79461 | 0.80309 |
model = DecisionTreeClassifier(random_state=1, class_weight="balanced")
model.fit(X_train, y_train) # fit model on train data using the above set parameters
DecisionTreeClassifier(class_weight='balanced', random_state=1)
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train # performance on train data
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99311 | 0.99510 | 0.98415 | 0.98960 |
confusion_matrix_sklearn(model, X_test, y_test) # create confusion matrix
decision_tree_perf_test = model_performance_classification_sklearn(
model, X_test, y_test # performance on test data
)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.86621 | 0.80494 | 0.78663 | 0.79568 |
feature_names = list(X_train.columns) # train data columns
importances = model.feature_importances_ # show their relative importances
indices = np.argsort(importances)
# parameters to draw graph
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 7, 2),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)
confusion_matrix_sklearn(estimator, X_train, y_train) # create confusion matrix
decision_tree_tune_perf_train = model_performance_classification_sklearn(
estimator, X_train, y_train
) # performance on train set
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83097 | 0.78608 | 0.72425 | 0.75390 |
confusion_matrix_sklearn(estimator, X_test, y_test) # create confusion matrix
decision_tree_tune_perf_test = model_performance_classification_sklearn(
estimator, X_test, y_test
) # performance on test set
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83497 | 0.78336 | 0.72758 | 0.75444 |
feature_names = list(X_train.columns) # train data columns
importances = estimator.feature_importances_
indices = np.argsort(importances)
# parameters to draw decision tree
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- avg_price_per_room <= 196.50 | | | | | | |--- weights: [1736.39, 133.59] class: 0 | | | | | |--- avg_price_per_room > 196.50 | | | | | | |--- weights: [0.75, 24.29] class: 1 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- lead_time <= 68.50 | | | | | | |--- weights: [960.27, 223.16] class: 0 | | | | | |--- lead_time > 68.50 | | | | | | |--- weights: [129.73, 160.92] class: 1 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- weights: [214.72, 227.72] class: 1 | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- weights: [82.76, 285.41] class: 1 | | | | |--- lead_time > 117.50 | | | | | |--- no_of_week_nights <= 1.50 | | | | | | |--- weights: [87.23, 81.98] class: 0 | | | | | |--- no_of_week_nights > 1.50 | | | | | | |--- weights: [228.14, 48.58] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 99.44 | | | | | |--- arrival_month <= 1.50 | | | | | | |--- weights: [92.45, 0.00] class: 0 | | | | | |--- arrival_month > 1.50 | | | | | | |--- weights: [363.83, 132.08] class: 0 | | | | |--- avg_price_per_room > 99.44 | | | | | |--- lead_time <= 3.50 | | | | | | |--- weights: [219.94, 85.01] class: 0 | | | | | |--- lead_time > 3.50 | | | | | | |--- weights: [132.71, 280.85] class: 1 | | | |--- lead_time > 13.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- avg_price_per_room <= 71.92 | | | | | | |--- weights: [158.80, 159.40] class: 1 | | | | | |--- avg_price_per_room > 71.92 | | | | | | |--- weights: [850.67, 3543.28] class: 1 | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [48.46, 1.52] class: 0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 102.50 | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | |--- weights: [697.09, 9.11] class: 0 | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | |--- weights: [15.66, 9.11] class: 0 | | | | |--- lead_time > 102.50 | | | | | |--- no_of_week_nights <= 2.50 | | | | | | |--- weights: [32.06, 19.74] class: 0 | | | | | |--- no_of_week_nights > 2.50 | | | | | | |--- weights: [44.73, 3.04] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- weights: [498.03, 44.03] class: 0 | | | | | |--- lead_time > 4.50 | | | | | | |--- weights: [258.71, 63.76] class: 0 | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- weights: [2512.51, 1451.32] class: 0 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [134.20, 1.52] class: 0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [1585.04, 0.00] class: 0 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [180.42, 57.69] class: 0 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [52.19, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- weights: [184.90, 56.17] class: 0 | | | | | |--- arrival_month > 8.50 | | | | | | |--- weights: [106.61, 106.27] class: 0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [67.10, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- weights: [3.73, 24.29] class: 1 | | | | | |--- lead_time > 163.50 | | | | | | |--- weights: [257.96, 62.24] class: 0 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- weights: [8.95, 3.04] class: 0 | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [0.75, 97.16] class: 1 | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.47 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- weights: [2.98, 282.37] class: 1 | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- weights: [213.97, 385.60] class: 1 | | | | |--- avg_price_per_room > 82.47 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- weights: [23.86, 1030.80] class: 1 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [5.22, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- lead_time <= 159.50 | | | | | | |--- weights: [7.46, 7.59] class: 1 | | | | | |--- lead_time > 159.50 | | | | | | |--- weights: [37.28, 4.55] class: 0 | | | | |--- lead_time > 180.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [20.13, 212.54] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- weights: [231.12, 110.82] class: 0 | | | | | |--- arrival_month > 11.50 | | | | | | |--- weights: [19.38, 34.92] class: 1 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- lead_time <= 348.50 | | | | | | |--- weights: [106.61, 3.04] class: 0 | | | | | |--- lead_time > 348.50 | | | | | | |--- weights: [5.96, 4.55] class: 0 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [0.00, 3200.19] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [23.11, 0.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [35.04, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 24.50 | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | |--- arrival_date > 24.50 | | | | | |--- weights: [3.73, 22.77] class: 1
# importance of features in the tree building
importances = estimator.feature_importances_
indices = np.argsort(importances)
# graph parameters
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
clf = DecisionTreeClassifier(
random_state=1, class_weight="balanced"
) # parameters for setting the cost complexity path
path = clf.cost_complexity_pruning_path(
X_train, y_train
) # cost complexity on train data
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path) # output of cost complexity computation
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.00000 | 0.00838 |
| 1 | 0.00000 | 0.00838 |
| 2 | 0.00000 | 0.00838 |
| 3 | 0.00000 | 0.00838 |
| 4 | 0.00000 | 0.00838 |
| ... | ... | ... |
| 1839 | 0.00890 | 0.32806 |
| 1840 | 0.00980 | 0.33786 |
| 1841 | 0.01272 | 0.35058 |
| 1842 | 0.03412 | 0.41882 |
| 1843 | 0.08118 | 0.50000 |
1844 rows × 2 columns
# parameters to draw graph for total impurity vs effective alpha for training data
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Now, we train a decision tree using effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
)
clf.fit(X_train, y_train) # fit decision tree on training data
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.0811791438913696
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
f1_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = f1_score(y_train, pred_train)
f1_train.append(values_train)
f1_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = f1_score(y_test, pred_test)
f1_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5)) # size configurations
ax.set_xlabel("alpha") # label of x-axis
ax.set_ylabel("F1 Score") # label of y axis
ax.set_title("F1 Score vs alpha for training and testing sets") # title of plot
ax.plot(
ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post"
) # parameters for plot on train data
ax.plot(
ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post"
) # parameters for plot on test data
ax.legend() # plot legend in the graph
plt.show() # show graph
# Create model to obtain the highest train and test recall
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.00012267633155167043,
class_weight='balanced', random_state=1)
confusion_matrix_sklearn(best_model, X_train, y_train) # create confusion matrix
decision_tree_post_perf_train = model_performance_classification_sklearn(
best_model, X_train, y_train # performance of decision tree on train data
)
decision_tree_post_perf_train # show results
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.89954 | 0.90303 | 0.81274 | 0.85551 |
confusion_matrix_sklearn(best_model, X_test, y_test) # create confusion matrix
decision_tree_post_test = model_performance_classification_sklearn(
best_model, X_test, y_test
) # performance of decision tree on test data
decision_tree_post_test # show results after the pruning
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.86879 | 0.85576 | 0.76614 | 0.80848 |
feature_names = list(X_train.columns)
importances = estimator.feature_importances_
indices = np.argsort(importances)
# parameters to show decision tree diagram
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- avg_price_per_room <= 196.50 | | | | | | |--- weights: [1736.39, 133.59] class: 0 | | | | | |--- avg_price_per_room > 196.50 | | | | | | |--- weights: [0.75, 24.29] class: 1 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- lead_time <= 68.50 | | | | | | |--- weights: [960.27, 223.16] class: 0 | | | | | |--- lead_time > 68.50 | | | | | | |--- weights: [129.73, 160.92] class: 1 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- weights: [214.72, 227.72] class: 1 | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- weights: [82.76, 285.41] class: 1 | | | | |--- lead_time > 117.50 | | | | | |--- no_of_week_nights <= 1.50 | | | | | | |--- weights: [87.23, 81.98] class: 0 | | | | | |--- no_of_week_nights > 1.50 | | | | | | |--- weights: [228.14, 48.58] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 99.44 | | | | | |--- arrival_month <= 1.50 | | | | | | |--- weights: [92.45, 0.00] class: 0 | | | | | |--- arrival_month > 1.50 | | | | | | |--- weights: [363.83, 132.08] class: 0 | | | | |--- avg_price_per_room > 99.44 | | | | | |--- lead_time <= 3.50 | | | | | | |--- weights: [219.94, 85.01] class: 0 | | | | | |--- lead_time > 3.50 | | | | | | |--- weights: [132.71, 280.85] class: 1 | | | |--- lead_time > 13.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- avg_price_per_room <= 71.92 | | | | | | |--- weights: [158.80, 159.40] class: 1 | | | | | |--- avg_price_per_room > 71.92 | | | | | | |--- weights: [850.67, 3543.28] class: 1 | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [48.46, 1.52] class: 0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 102.50 | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | |--- weights: [697.09, 9.11] class: 0 | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | |--- weights: [15.66, 9.11] class: 0 | | | | |--- lead_time > 102.50 | | | | | |--- no_of_week_nights <= 2.50 | | | | | | |--- weights: [32.06, 19.74] class: 0 | | | | | |--- no_of_week_nights > 2.50 | | | | | | |--- weights: [44.73, 3.04] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- weights: [498.03, 44.03] class: 0 | | | | | |--- lead_time > 4.50 | | | | | | |--- weights: [258.71, 63.76] class: 0 | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- weights: [2512.51, 1451.32] class: 0 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [134.20, 1.52] class: 0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [1585.04, 0.00] class: 0 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [180.42, 57.69] class: 0 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [52.19, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- weights: [184.90, 56.17] class: 0 | | | | | |--- arrival_month > 8.50 | | | | | | |--- weights: [106.61, 106.27] class: 0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [67.10, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- weights: [3.73, 24.29] class: 1 | | | | | |--- lead_time > 163.50 | | | | | | |--- weights: [257.96, 62.24] class: 0 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- weights: [8.95, 3.04] class: 0 | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [0.75, 97.16] class: 1 | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.47 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- weights: [2.98, 282.37] class: 1 | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- weights: [213.97, 385.60] class: 1 | | | | |--- avg_price_per_room > 82.47 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- weights: [23.86, 1030.80] class: 1 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [5.22, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- lead_time <= 159.50 | | | | | | |--- weights: [7.46, 7.59] class: 1 | | | | | |--- lead_time > 159.50 | | | | | | |--- weights: [37.28, 4.55] class: 0 | | | | |--- lead_time > 180.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [20.13, 212.54] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- weights: [231.12, 110.82] class: 0 | | | | | |--- arrival_month > 11.50 | | | | | | |--- weights: [19.38, 34.92] class: 1 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- lead_time <= 348.50 | | | | | | |--- weights: [106.61, 3.04] class: 0 | | | | | |--- lead_time > 348.50 | | | | | | |--- weights: [5.96, 4.55] class: 0 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [0.00, 3200.19] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [23.11, 0.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [35.04, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 24.50 | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | |--- arrival_date > 24.50 | | | | | |--- weights: [3.73, 22.77] class: 1
# showing relative importances of features
importances = best_model.feature_importances_
indices = np.argsort(importances)
# graph parameters
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train_without.T,
decision_tree_tune_perf_train.T,
decision_tree_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.99421 | 0.83097 | 0.89954 |
| Recall | 0.98661 | 0.78608 | 0.90303 |
| Precision | 0.99578 | 0.72425 | 0.81274 |
| F1 | 0.99117 | 0.75390 | 0.85551 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_perf_test_without.T,
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_post_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree without class_weight",
"Decision Tree with class_weight",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Decision Tree without class_weight | Decision Tree with class_weight | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|---|
| Accuracy | 0.87118 | 0.86621 | 0.83497 | 0.86879 |
| Recall | 0.81175 | 0.80494 | 0.78336 | 0.85576 |
| Precision | 0.79461 | 0.78663 | 0.72758 | 0.76614 |
| F1 | 0.80309 | 0.79568 | 0.75444 | 0.80848 |